SceneSeer: 3D Scene Design with Natural Language
نویسندگان
چکیده
Designing 3D scenes is currently a creative task that requires significant expertise and effort in using complex 3D design interfaces. This effortful design process starts in stark contrast to the easiness with which people can use language to describe real and imaginary environments. We present SCENESEER: an interactive text to 3D scene generation system that allows a user to design 3D scenes using natural language. A user provides input text from which we extract explicit constraints on the objects that should appear in the scene. Given these explicit constraints, the system then uses a spatial knowledge base learned from an existing database of 3D scenes and 3D object models to infer an arrangement of the objects forming a natural scene matching the input description. Using textual commands the user can then iteratively refine the created scene by adding, removing, replacing, and manipulating objects. We evaluate the quality of 3D scenes generated by SCENESEER in a perceptual evaluation experiment where we compare against manually designed scenes and simpler baselines for 3D scene generation. We demonstrate how the generated scenes can be iteratively refined through simple natural language commands. INTRODUCTION Designing 3D scenes is a challenging creative task. Expert users expend considerable effort in learning how to use complex 3D scene design tools. Still, immense manual effort is required, leading to high costs for producing 3D content in video games, films, interior design, and architectural visualization. Despite the conceptual simplicity of generating pictures from descriptions, systems for text-to-scene generation have only achieved limited success. How might we allow people to create 3D scenes using simple natural language? Current 3D design tools provide a great amount of control over the construction and precise positioning of geometry within 3D scenes. However, most of these tools do not allow for intuitively assembling a scene from existing objects which is critical for non-professional users. As an analogue, in real life few people are carpenters, but most of us have bought and arranged furniture. For the purposes of defining how to compose and arrange objects into scenes, natural language is an obvious interface. It is much easier to say “Put a blue bowl on the dining table” rather than retrieving, inserting and orienting a 3D model of a bowl. Text to 3D scene interfaces can empower a broader demographic to create 3D scenes for games, interior design, and virtual storyboarding. Text to 3D scene systems face several technical challenges. Firstly, natural language is typically terse and incomplete. People rarely mention many facts about the world since these facts can often be safely assumed. Most desks are upright and on the floor but few people would mention this explicitly. This implicit spatial knowledge is critical for scene generation but hard to extract. Secondly, people reason about the 1 ar X iv :1 70 3. 00 05 0v 1 [ cs .G R ] 2 8 Fe b 20 17 world at a much higher level than typical representations of 3D scenes (using the descriptive phrase table against wall vs a 3D transformation matrix). The semantics of objects and their approximate arrangement are typically more important than the precise and abstract properties of geometry. Most 3D scene design tools grew out of the traditions of Computer Aided Design and architecture where precision of control and specification is much more important than for casual users. Traditional interfaces allow for comprehensive control but are typically not concerned with high level semantics. SCENESEER allows users to generate and manipulate 3D scenes at the level of everyday semantics through simple natural language. It leverages spatial knowledge priors learned from existing 3D scene data to infer implicit, unmentioned constraints and resolve view-dependent spatial relations in a natural way. For instance, given the sentence “there is a dining table with a cake”, we can infer that the cake is most likely on a plate and that the plate is most likely on the table. This elevation of 3D scene design to the level of everyday semantics is critical for enabling intuitive design interfaces, rapid prototyping, and coarse-to-fine refinements. In this paper, we present a framework for the text to 3D scene task and use it to motivate the design of the SCENESEER system. We demonstrate that SCENESEER can be used to generate 3D scenes from terse, natural language descriptions. We empirically evaluate the quality of the generated scenes with a human judgment experiment and find that SCENESEER can generate high quality scenes matching the input text. We show how textual commands can be used interactively in SCENESEER to manipulate generated 3D scenes.
منابع مشابه
Text to 3d Scene Generation a Dissertation Submitted to the Department of Computer Science and the Committee on Graduate Studies of Stanford University in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy
The ability to form a visual interpretation of the world from natural language is pivotal to human communication. Similarly, from a computational perspective, mapping descriptions of scenes to 3D geometric representations is useful in many areas such as robotics, interior design and even education. Text to 3D scene generation is a task which addresses this problem space. A user provides natural...
متن کاملLearning Spatial Knowledge for Text to 3D Scene Generation
We address the grounding of natural language to concrete spatial constraints, and inference of implicit pragmatics in 3D environments. We apply our approach to the task of text-to-3D scene generation. We present a representation for common sense spatial knowledge and an approach to extract it from 3D scene data. In text-to3D scene generation, a user provides as input natural language text from ...
متن کاملStudying the Role of Location in 3D Scene Description using Natural Language
In this paper the description of 3D indoor scenes in natural language is studied from the point of view of intrinsic and relative location of the objects. An approach has been developed for this purpose which uses a XBox 360 Kinect in combination with ROS and PCL to obtain 3D-data from the scene. Object features are computed on these 3D-data, which are used to generate a SVM-model which classif...
متن کاملA Language Visualization System
In this thesis, a novel language visualization system is presented that converts natural language text into 3D scenes. The system is capable of understanding some concrete nouns, visualizable adjectives and spatial prepositions from full natural language sentences and generating 3D static scenes using these sentences. It is a rule based system that uses natural language processing tools, 3D mod...
متن کاملInteractive Learning of Spatial Knowledge for Text to 3D Scene Generation
We present an interactive text to 3D scene generation system that learns the expected spatial layout of objects from data. A user provides input natural language text from which we extract explicit constraints on the objects that should appear in the scene. Given these explicit constraints, the system then uses prior observations of spatial arrangements in a database of scenes to infer the most...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1703.00050 شماره
صفحات -
تاریخ انتشار 2017